时空视频接地(STVG)的重点是检索由自由形式的文本表达式描绘的特定物体的时空管。现有方法主要将这一复杂的任务视为平行框架的问题,因此遭受了两种类型的不一致缺点:特征对齐不一致和预测不一致。在本文中,我们提出了一个端到端的一阶段框架,称为时空的一致性变压器(STCAT),以减轻这些问题。特别是,我们引入了一个新颖的多模式模板,作为解决此任务的全球目标,该目标明确限制了接地区域并将所有视频框架之间的预测联系起来。此外,为了在足够的视频文本感知下生成上述模板,提出了一个编码器架构来进行有效的全局上下文建模。由于这些关键设计,STCAT享有更一致的跨模式特征对齐和管预测,而无需依赖任何预训练的对象探测器。广泛的实验表明,我们的方法在两个具有挑战性的视频基准(VIDSTG和HC-STVG)上胜过先前的最先进的,这说明了拟议框架的优越性,以更好地理解视觉与自然语言之间的关联。代码可在\ url {https://github.com/jy0205/stcat}上公开获得。
translated by 谷歌翻译
设计私人投票规则是值得信赖的民主的重要问题。在本文中,根据差异隐私的框架,我们根据知名的Condorcet方法提出了三类随机投票规则:Laplacian Condorcet方法($ cm^{lap} _ \ lambda $),指数condorcet方法($ cmcmential condorcet方法^{exp} _ \ lambda $)和随机响应condorcet方法($ cm^{rr} _ \ lambda $),其中$ \ lambda $代表噪声级别。通过准确估计随机性引入的错误,我们表明$ cm^{exp} _ \ lambda $是大多数情况下最准确的机制。我们证明,我们的所有规则都满足绝对单调性,Lexi参与,概率帕累托效率,近似概率孔孔标准和近似SD-StrategyProofness。此外,$ cm^{rr} _ \ lambda $满足(非适当的)概率condorcet标准,而$ cm^{lap} _ \ lambda $和$ cm^{exp} _ \ \ lambda _ 。最后,我们将差异隐私视为投票公理,并讨论其与其他公理的关系。
translated by 谷歌翻译
In time series forecasting, decomposition-based algorithms break aggregate data into meaningful components and are therefore appreciated for their particular advantages in interpretability. Recent algorithms often combine machine learning (hereafter ML) methodology with decomposition to improve prediction accuracy. However, incorporating ML is generally considered to sacrifice interpretability inevitably. In addition, existing hybrid algorithms usually rely on theoretical models with statistical assumptions and focus only on the accuracy of aggregate predictions, and thus suffer from accuracy problems, especially in component estimates. In response to the above issues, this research explores the possibility of improving accuracy without losing interpretability in time series forecasting. We first quantitatively define interpretability for data-driven forecasts and systematically review the existing forecasting algorithms from the perspective of interpretability. Accordingly, we propose the W-R algorithm, a hybrid algorithm that combines decomposition and ML from a novel perspective. Specifically, the W-R algorithm replaces the standard additive combination function with a weighted variant and uses ML to modify the estimates of all components simultaneously. We mathematically analyze the theoretical basis of the algorithm and validate its performance through extensive numerical experiments. In general, the W-R algorithm outperforms all decomposition-based and ML benchmarks. Based on P50_QL, the algorithm relatively improves by 8.76% in accuracy on the practical sales forecasts of JD.com and 77.99% on a public dataset of electricity loads. This research offers an innovative perspective to combine the statistical and ML algorithms, and JD.com has implemented the W-R algorithm to make accurate sales predictions and guide its marketing activities.
translated by 谷歌翻译
大型策划数据集是必要的,但是注释医学图像是一个耗时,费力且昂贵的过程。因此,最近的监督方法着重于利用大量未标记的数据。但是,这样做是一项具有挑战性的任务。为了解决这个问题,我们提出了一种新的3D Cross伪监督(3D-CPS)方法,这是一种基于NNU-NET的半监督网络体系结构,采用交叉伪监督方法。我们设计了一种新的基于NNU-NET的预处理方法,并在推理阶段采用强制间距设置策略来加快推理时间。此外,我们将半监督的损耗重量设置为与每个时期的线性扩展,以防止在早期训练过程中模型从低质量的伪标签中。我们提出的方法在MICCAI Flare2022验证集(20例)上,平均骰子相似系数(DSC)为0.881,平均归一化表面距离(NSD)为0.913。
translated by 谷歌翻译
位置识别在机器人和车辆的重新定位和循环封闭检测任务中起着至关重要的作用。本文为基于激光雷达的位置识别寻求明确定义的全球描述符。与本地描述符相比,全球描述符在城市道路场景中表现出色,但通常依赖于观点。为此,我们提出了一个简单而坚固的全局描述符,称为壁画,通过利用傅立叶变换和圆形转移技术,可以分解重新访问期间的视点差异,并实现翻译和旋转不变性。此外,还提出了一种快速的两阶段姿势估计方法,以利用从场景中提取的紧凑型2D点云来估计位置回收后的相对姿势。实验表明,在来自多个数据集的不同场景的序列上,壁画表现出比同期方法表现出更好的性能。该代码将在https://github.com/soytony/fresco上公开获取。
translated by 谷歌翻译
利用6DOF(自由度)对象的姿势信息及其组件对于对象状态检测任务至关重要。我们展示了IKEA对象状态数据集,该数据集包含宜家家具3D模型,装配过程的RGBD视频,家具部件的6dof姿势及其边界盒。建议的数据集将在https://github.com/mxllmx/ikeaObjectstateTateDataSet上使用。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译